Similarity Measures for Smooth Web Page Classification

نویسندگان

  • Nikolay Jetchev
  • Stefan Ulbrich
چکیده

This thesis examines the application of consistency learning techniques for the classification of hyperlinked web pages. Different data similarity measures between the web pages are defined, using local features like textual content and features of the linked pages as a graph. The pairwise object similarities are gathered in similarity matrices, each of which can be used together with methods from consistency learning to make classification smooth with respect to the data structure revealed by these similarity matrices and improve the accuracy of a simple text classifier. To achieve even better performance in the primary task of web page classification, a secondary machine learning problem is defined as finding the optimal similarity matrix combination. The results on several hyperlinked text collections, including the well known WebKB collection show significantly better accuracy of the smooth learning methods over the plain text classification. The main novel contribution of this thesis is the definition and testing of various similarity measures between web pages and the construction of a locally flexible similarity measure from heterogeneous data sources that improves classification accuracy on each of them. These ideas may also be used with little modification in other domains besides web page classification, like bioinformatics and citation graph classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of Web-pages Classification Methods using Fuzzy Operators Applied to Arabic Web-pages

In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and compared in this study. These measures include: Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and Bounded Difference ap...

متن کامل

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

An adaptive neural network approach to hypertext clustering

The WWW is an on-line hypertextual collection, and a more sophisticated algorithm for Web page clustering may have to be based on combined term-similarity and hyperlink-similarity measures. It has been observed that nearly all currently employed techniques for document classification on the Web make use of textual information only. In addition, most of these techniques are incapable of discover...

متن کامل

Semantic similarity based web document classification using support vector machine

With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification,...

متن کامل

Impact of Similarity Measures on Web-page Clustering

Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007